Similarity Group-by Operators for Multi-dimensional Relational Data (Extended Abstract)
نویسندگان
چکیده
The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multidimensional data. The first operator is the clique (or distance-toall) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Groupby. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.
منابع مشابه
On Order-independent Semantics of the Similarity Group-By Relational Database Operator
Similarity group-by (SGB, for short) has been proposed as a relational database operator to match the needs of emerging database applications. Many SGB operators that extend SQL have been proposed in the literature, e.g., similarity operators in the one-dimensional space. These operators have various semantics. Depending on how these operators are implemented, some of the implementations may le...
متن کاملThe similarity-aware relational database set operators
Identifying similarities in large datasets is an essential operation in several applications such as bioinformatics, pattern recognition, and data integration. To make a relational database management system similarity-aware, the core relational operators have to be extended. While similarity-awareness has been introduced in database engines for relational operators such as joins and group-by, ...
متن کاملThe Similarity-Aware Relational Intersect Database Operator
Identifying similarities in large datasets is an essential operation in many applications such as bioinformatics, pattern recognition, and data integration. To make the underlying database system similarity-aware, the core relational operators have to be extended. Several similarity-aware relational operators have been proposed that introduce similarity processing at the database engine level, ...
متن کاملIdentifying Algebraic Properties to Support Optimization of Unary Similarity Queries
Conventional operators for data retrieval are either based on exact matching or on total order relationship among elements. Neither of them is appropriate to manage complex data, such as multimedia data, time series and genetic sequences. In fact, the most meaningful way to compare complex data is by similarity. However, the Relational Algebra, employed in the Relational Database Management Sys...
متن کاملA Class of LULU Operators on Multi-Dimensional Arrays
The LULU operators for sequences are extended to multi-dimensional arrays via the morphological concept of connection in a way which preserves their essential properties, e.g. they are separators and form a four element fully ordered semi-group. The power of the operators is demonstrated by deriving a total variation preserving discrete pulse decomposition of images.
متن کامل